A genomic data archive from the Network for Pancreatic Organ donors with Diabetes

The Network for Pancreatic Organ donors with Diabetes (nPOD) is the largest biorepository of human pancreata and associated immune organs from donors with type 1 diabetes (T1D), maturity-onset diabetes of the young (MODY), cystic fibrosis-related diabetes (CFRD), type 2 diabetes (T2D), gestational diabetes, islet autoantibody positivity (AAb+), and without diabetes. nPOD recovers, processes, analyzes, and distributes high-quality biospecimens, collected using optimized standard operating procedures, and associated de-identified data/metadata to researchers around the world. Herein describes the release of high-parameter genotyping data from this collection. 372 donors were genotyped using a custom precision medicine single nucleotide polymorphism (SNP) microarray. Data were technically validated using published algorithms to evaluate donor relatedness, ancestry, imputed HLA, and T1D genetic risk score. Additionally, 207 donors were assessed for rare known and novel coding region variants via whole exome sequencing (WES). These data are publicly-available to enable genotype-specific sample requests and the study of novel genotype:phenotype associations, aiding in the mission of nPOD to enhance understanding of diabetes pathogenesis to promote the development of novel therapies.

These genotyping data have been generated and made accessible to enable genotype-selected sample requests and the study of novel genotype:phenotype associations by the international community of nPOD investigators. We anticipate that the diversity of nPOD donor genetics may be partly responsible for inter-donor heterogeneity observed in islet health, insulitis composition, age at T1D onset, islet AAb status, and other endotype-related characteristics [23][24][25] . Importantly, beyond explaining diabetes heterogeneity, the findings facilitated by these data are expected to inform precision medicine strategies for prevention or suspension of the pathogenesis of T1D as well as other forms of diabetes.

Methods
Donor tissues. Transplant-quality organs, including pancreas and up to 13 other tissues, were recovered from cadaveric organ donors by United States (U.S.) organ procurement organizations (OPOs, http://www. jdrfnpod.org//for-partners/npod-partners/, accessed October 15, 2021) in accordance with federal guidelines, then processed by the nPOD Organ Processing and Pathology Core (OPPC) according to University of Florida (UF) Institutional Review Board (IRB) approved protocol IRB201600029, as previously described 8,11 . Studies conducted using organ donor tissue samples from the nPOD biobank are classified as minimum risk research, as study participants are no longer living. However, informed consent for research participation is obtained from family members via both written and verbal communication prior to organ donation, with the consent processes undertaken by qualified personnel affiliated with the U.S. OPO network. All subject information is de-identified in accordance with HIPAA regulations. For each donor, clinical and demographic information, were obtained via medical chart review and OPO-conducted interview with the donor's family. High-resolution four-digit HLA typing was performed by Next Generation Sequencing (NGS) as previously described 8,12 at the Barbara Davis Center for Childhood Diabetes HLA Core (University of Colorado Anschutz Medical Campus). nPOD donors were categorized by diabetes type, verified by UF endocrinologist review of the de-identified terminal medical records (including diagnosis and duration of diabetes, history or clinical data for diabetic ketoacidosis, medications, and BMI), donor metadata (e.g., age, sex, reported race and ethnicity), and additional data (serum C-peptide levels, islet AAb status 10 , hemoglobin A1c [HbA1c], and high-resolution HLA 8,12 ). Unique research resource identifiers (RRIDs) were assigned to each organ donor, in order to facilitate the provenance and reproducibility of results 26 .  (Fig. 1b). The base array is the Axiom TM Precision Medicine Research Array (Thermo Fisher Scientific), to which all content from the ImmunoChip v2 28 was added, as well as all previously reported credible T1D risk variants 3 (Fig. 2, UFDIchip_library_file.xlsx 27 ). The array also includes dense coverage of the highly polymorphic HLA region, which allows for accurate imputation of HLA haplotypes to 4-digit resolution.
Genotype processing and analysis. UFDIchip plates were processed on an Affymetrix GeneTitan instrument with external sample handling on a BioMek FX dual arm robotic workstation. Axiom ™ Analysis Suite software (v3.0, Thermo Fisher Scientific) was used to process raw CEL file data to plink text files. The software includes quality control (QC) procedures at the sample, plate, and SNP levels. These QC threshold parameters were set to Axiom ™ Analysis Suite default stringency ("Best Practices Workflow" using "Human.legacy.v5" settings A screen for discordance from reported sex via X chromosome heterozygosity was then performed using plink v1.9 29 . All data passed these QC screens and raw CEL files and a binary plink file containing processed data (GRCh37/hg19) from all cases are stored in the database of Genotypes and Phenotypes (dbGaP) 27 . Subsequent analyses included relatedness estimation using KING 30 , genetic ancestry imputation using ADMIXTURE 18   www.nature.com/scientificdata www.nature.com/scientificdata/ Relatedness. Genotyping data from the nPOD cohort with unknown and from the 1000 Genomes phase 3 cohort 32 with known family relationships were merged and analyzed for genetic relatedness using KING 30 (v2.1.2) software. The integrated relationship inference command was used to infer up to third-degree relatives. Relationships between nPOD case pairs and between 1000 Genomes pairs were represented by plotting estimated kinship coefficients. Kinship coefficients of unrelated 1000 Genomes pairs were randomly downsampled to the number of nPOD subject pairs to allow for data visualization.
Genetic ancestry. Data from unrelated subjects from the 1000 Genomes phase 3 cohort 32 were filtered for SNPs that overlap with the UFDIchip array using plink v1.9 29 . The data were pruned for linkage disequilibrium (LD) by removing SNPs with R 2 > 0.1, screening within a 50 SNP block and proceeding by steps of 10 SNPs. This yielded 1000 Genomes genotypes for 320,005 SNPs, which were used to run an unsupervised analysis using ADMIXTURE software 18  Imputation accuracy for each of these loci [Acc(L)] was calculated as previously reported 33 , substituting the dosage for the probability score that is provided by Axiom TM HLA Analysis Software 19 : where P i is the probability for imputed alleles A1 i,L and A2 i,L for donor i at locus L. Imputed alleles were considered concordant when they were included in the donor's set of typed alleles at locus L, and discordant when they were not in the set of typed alleles at locus L. For discordant alleles, P i was set to 0. The summation of probabilities for the total number of donors assessed, n, was then divided by the total number of alleles tested, 2n. The accuracy score ranges from 0, for no concordant calls, to 1, for complete concordance with probabilities of 1 for all alleles.
Concordance was calculated at the 2-digit and 4-digit level for genotypes related to T1D risk or protection, as determined in primarily White cohorts [34][35][36] where the number of imputed alleles A1 i,L and A2 i,L matching the genotype of interest for donor i at locus L was summed across all donors, n, and divided by the number of typed alleles A1 t,L and A2 t,L matching the genotype of interest for donor i at locus L summed across all donors. The accuracy score ranges from 0, for no concordant calls, to 1, for complete concordance. Donor-level imputation accuracy [Acc(S)] was calculated as: www.nature.com/scientificdata www.nature.com/scientificdata/ where P j is the probability for imputed alleles A1 j,S and A2 j,S at each HLA locus j of donor S. Concordance was determined as described above, and P j was set to 0 for discordant alleles. The total number of loci tested, n, was 4 per donor (HLA-A, HLA-DRB1, HLA-DQA1, and HLA-DQB1). The accuracy score ranges from 0, for no concordant calls, to 1, for complete concordance with probabilities of 1 for both alleles at each locus for donor S.

T1D GRS calculation.
We computed polygenic T1D genetic risk scores, referred to as GRS1 21,22 , GRS2 37 , and African-Ancestry (AA)-GRS 38 . GRS1 is calculated using dosages of risk genotypes for 30 T1D-associated SNPs 21 . Genotypes were obtained by imputing to the TOPMed (v r2) 31 panel (R 2 > 0.97). rs2187668 was not found in TOPMed, thus, a suitable proxy SNP from GRS2 37 , rs9273369, was used instead. The HLA component of GRS1 was calculated using the Polygenic Risk Score (PRS) Toolkit for HLA (v0.22a) developed by Sharp et al. 37 . The non-HLA component of GRS1 was then calculated via weighted sum, using odds ratios from Oram et al. 21 . The HLA and non-HLA scores were summed and normalized as described in Oram et al. 21 . GRS2 is calculated using dosages of risk genotypes for 67 T1D-associated SNPs 37 . Genotypes were obtained by imputing to the TOPMed (v r2) 31 panel (R 2 > 0.97). rs2476601, rs1281934, rs9273342, rs9271346, rs1233320, rs16822632, rs116522341, rs559242105, and rs371250843 were not found in TOPMed, thus, suitable proxy SNPs rs6679677, rs1281943, rs9273032, rs9271347, rs1233320, rs17840116, rs9268500, rs3129197, and rs9266268 were respectively used instead. The HLA component of GRS2 was calculated using the PRS Toolkit for HLA (v0.22a) developed by Sharp et al. 37 . The non-HLA component of GRS2 was then calculated via weighted sum, using odds ratios from Sharp et al. 37 and added to the HLA component. AA-GRS is calculated using dosages of risk genotypes for 7 T1D-associated SNPs 38 . Genotypes were obtained by imputing to the TOPMed (v r2) 31 panel (R 2 > 0.96). rs2187668 and rs34303755 were not found in TOPMed; thus, suitable proxy SNPs rs9273369 and rs9268838 were respectively used instead. The AA-GRS was then calculated via weighted sum, using odds ratios from Onengut-Gumuscu et al. 38 . Table 2, Phenotype_data.txt 27 ), WES libraries were generated as previously described 39 (Fig. 1c) using the Agilent SureSelect Human All Exon kit (Agilent Technologies, CA, USA). Procedures and quality control (QC) measures were performed following manufacturer's recommendations. Briefly, 180-280 bp fragments were generated from genomic DNA by sonication (Covaris) with exonuclease and polymerase subsequently utilized to convert remaining overhangs into blunt ends. The DNA fragments were adenylated on the 3′ ends followed by ligation of adapter oligonucleotides. Successfully ligated DNA fragments were enriched by PCR. Following hybridization with biotin-labelled probes, exons were captured with streptavidin-coated magnetic beads. After a wash, probes were digested. Libraries were enriched and index tags added by PCR. Amplified exon libraries were purified using AMPure XP (Beckman Coulter), quantified by   www.nature.com/scientificdata www.nature.com/scientificdata/ in both assays and with at least one minor allele count (MAC) in the WES data were filtered using plink v1.9 29 , resulting in 27,852 variants for comparison. Per-SNP intra-assay concordance levels were calculated across all subjects.

Data Records
UFDIchip array data are stored in dbGaP 27 as raw CEL files and compiled processed data from all donors deposited as binary plink files (hg19). All genotyped donors, as well as their age, sex, reported race, diabetes status and duration, are provided in Phenotype_data.txt 27 with additional donor information available on the nPOD Data Portal (https://portal.jdrfnpod.org/, accessed October 21, 2022).
WES data are stored in dbGaP 27 , including raw exome sequencing data files (fastq format) or hg19 aligned exome sequencing data (bam format), in addition to processed variant files (vcf format). A spreadsheet listing variants and associated annotations per donor, as described in the methods, was also submitted (csv format). All donors subjected to WES, as well as their age, sex, reported race, diabetes status and duration, are listed in Phenotype_data.txt 27 with additional donor information available on the nPOD Data Portal (https://portal. jdrfnpod.org/, accessed October 21, 2022).  Table 3. Imputed HLA concordance with typed HLA for T1D-associated genotypes. For the 372 nPOD donors evaluated, the number of alleles (n) and concordance rate are displayed for donors carrying specified genotypes associated with risk or protection from T1D, as determined by high-resolution four-digit HLA typing by next generation sequencing (NGS) 8,12 . † DRB1*04:03-DQ8 is protective; ‡ DQ7 (DQA1*03:01-DQB1*03:01) is protective in DR4 haplotypes; # Calculations assume that typed HLA is accurate over imputed HLA.
www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation Quality control assessment of the UFDIchip genotype array. As of this report, 372 nPOD donors have been genotyped on the UFDIchip and the resulting data are accessible on dbGaP (see Data Records). All array results were subjected to basic QC analyses that assessed donor-level DQC; donor-, plate-, and SNP-level call rate; and sex concordance. Donor DQC or call rate failures were re-processed with freshly extracted DNA when necessary. nPOD samples were batch-processed with data from living donors 20 to facilitate calling of low-frequency variants 51 , resulting in 942,466 high quality genotypes passing the SNP call rate threshold. The nPOD cohort demonstrated SNP call rates of 99. 58 [99.19-99.84] (median [interquartile range (IQR)]). All nPOD samples were assessed for concordance between reported and imputed sex according to level of X chromosome heterozygosity using plink v1.9 29 . For all nPOD cases, imputed sex matched reported sex. Thus, all 372 nPOD samples passed basic QC measures. Additionally, 24 nPOD samples were run in technical replicate to assess assay reproducibility. Call rates between the technical replicates differed minimally, with 0.087 ± 0.640% bias (mean ± standard deviation, Fig. 3a). Importantly, the kinship coefficients between the 24 technical replicates were 0.499 [0.496-0.499] (median [IQR]), suggesting near identical genotype calls (Fig. 3b).
Relatedness estimation. Next, relatedness between donors was assessed. Due to the nature of donor organ procurement, it is highly improbable, although not impossible, that nPOD donors may be related. A relatedness  www.nature.com/scientificdata www.nature.com/scientificdata/ analysis of the 372 nPOD donors (69,006 possible pair combinations) using KING software 30 found that all of these donor pairs were inferred to be unrelated (>third-degree relatives). For comparison, we also assessed the relatedness of 2,504 1000 Genomes phase 3 cohort 32 subjects. While this set was designed to consist of unrelated individuals, it unintentionally included a few first-, second-, and third-degree relatives 32 . When relatedness between nPOD donor pairs was compared to relatedness between 1000 Genomes 32 subject pairs, nPOD donor pairs showed significantly smaller kinship coefficients than inferred parent-offspring (PO), full sibling (FS), 2 nd degree relative, and 3 rd degree relative pairs from 1000 Genomes (Fig. 3b), suggesting that nPOD donors are not closely related. Note that nPOD donor pairs had significantly larger kinship coefficients than inferred unrelated (UN) pairs from 1000 Genomes 32 (Fig. 3b), potentially due to increased similarity in genetic ancestry 52 between subjects in the nPOD cohort than in the 1000 Genomes cohort, which was specifically designed to sample individuals with diverse genetic ancestry. Beyond confirming expected relatedness in the nPOD cohort, this validates that users of this resource may employ population-level quantitative trait locus (QTL) analysis methods with these genetic data. alignment with genetic ancestry. To further validate the UFDIchip data, we used the 1000 Genomes phase 3 cohort 32 to build a reference model for genetic ancestry using ADMIXTURE software 18 (Fig. 4a,b), projected all 372 nPOD donors onto this model to impute ancestry (Fig. 4c), and compared those results with reported race. Using methods modified from Kaddis, et al. 53 , we plotted PCA results of ancestry proportions and observed that each of the five major continental populations in the 1000 Genomes cohort (AFR, AMR, EAS, EUR, and SAS) 32 clustered to occupy distinct PC space (Fig. 4b). This suggested that the five ancestry populations computed by ADMIXTURE were representative of the five continental populations from 1000 Genomes 32 . The ancestry proportions of 1000 Genomes 32 continental populations were almost entirely represented by a single ancestry group, with the exception of admixed populations including Admixed Americans (AMR), as well as the subcontinental populations, Americans of African ancestry in SW USA (ASW) and African Caribbeans in Barbados (ACB, Fig. 4a,b), as previously observed 32 . Next, the nPOD cohort was projected onto the 1000 Genomes reference, revealing overlap with AFR, AMR, EAS, and EUR groups in PC space (Fig. 4c). Donors were then assessed for agreement between reported race and genetic ancestry, showing that the highest AFR, AMR, EAS, and EUR ancestry proportions were observed in donors reported as Black/African American, Hispanic/ Latinx, Asian, and White/Caucasian respectively (UFDIchip_admixture.xlsx 27 ), which is consistent with other U.S.-based admixture studies 54,55 . Notably, racial identity is complex and the method of estimating proportions of continental genetic ancestries may not adequately reflect genetic diversity 56 . With this limitation in mind, these analyses accomplish the aims of: 1) ADMIXTURE model validation using UFDIchip array data and 2) qualification of the genetic ancestry results as an alternative or additional covariate to reported race for users of this resource (UFDIchip_admixture.xlsx 27 ) 57 .
HLA imputation accuracy and concordance. The nPOD cohort was HLA typed using next generation sequencing (NGS) at HLA-A, HLA-DRB1, HLA-DQA1, and HLA-DQB1 8 to identify donors with genotypes that are associated with T1D risk or protection [34][35][36] . This enables an extra level of QC and validation of the UFDIchip array data by comparing typed to genetically imputed HLA genotypes. Imputation accuracy at each locus, Acc(L), was calculated assuming typed results were correct if discordant with imputed results. Overall, Acc(L) was >0.93 for low-resolution HLA (2-digit) and >0.90 for high-resolution HLA (4-digit) for the four loci tested (UFDIchip_HLA_imputation_accuracy.xlsx 27 ).
Next, we assessed concordance between typed and imputed HLA for T1D risk (A2, A24, DR3, DQ2, DR4, DQ8, DR8, and DQ4) or protective (DR15, DQ6, and DQ7) genotypes [34][35][36] (Table 3). At 2-digit resolution, all tested loci were greater than 93% concordant (median [IQR]: 98.5% [97.4%-99.8%], Table 3). At 4-digit resolution, HLA concordance was predictably lower (median [IQR]: 97.1% [92.5%-99.3%]), with notable discordance in the less common HLA-DRB1*04:xx genotypes (Table 3). Importantly, 4-digit genotypes that convert 2-digit risk to protective genotypes, such as HLA-DRB1*04:03 and HLA-DQB1*03:01, were accurately imputed with greater than 97.9% concordance (Table 3). Data validation at the sample level was assessed using a sample imputation accuracy score, Acc(S), for 2-digit HLA at the four typed loci. Acc(S) was 0.984 [0.946-0.998] (median [IQR]), indicating high performance of the imputation methodology per sample (Fig. 5a). HLA imputation may be inaccurate when rare HLA genotypes and ancestrally diverse populations are underrepresented in the reference cohort 33,58,59 . In agreement with this notion, a breakdown of nPOD donors by reported race or by top genetic ancestry proportion suggests that  www.nature.com/scientificdata www.nature.com/scientificdata/ imputation accuracy could potentially be improved with greater reference cohort diversity (Fig. 5b,c). Donors with reported race of White had significantly higher HLA imputation accuracy than those reported as Black or Hispanic/Latinx (Fig. 5b). Similarly, donors whose highest genetic ancestry proportion were EUR had higher imputation accuracy than donors whose were AFR or AMR (Fig. 5c). Notably, 4-digit HLA imputation showed 100% concordance for the 24 nPOD subjects run in technical replicate on the UFDIchip. T1D polygenic GRS performance using UFDIchip data. Polygenic risk scores summarize genetic risk for T1D as a continuous value by aggregating estimated risk at HLA and non-HLA loci 21,22,37,38 . The original reports of GRS1 described its utility for discerning T1D from other forms of diabetes including T2D 21,60 and MODY 22 . We previously observed that a similar GRS robustly discriminated living controls from T1D subjects reported as White but was less effective for non-White subjects, highlighting a need for diversity in risk modeling 20,53 . Shortly thereafter, GRS2 was developed to incorporate the impact of interactions between HLA haplotypes on T1D risk, showing improved discrimination of European ancestry T1D from control subjects 37 . Additionally, an AA-GRS was created to account for ancestry-specific T1D risk loci, with enhanced performance at distinguishing T1D from control subjects in AFR populations 38 . We therefore attempted to validate these previous findings regarding the ability of GRS1, GRS2, and AA-GRS to differentiate controls from T1D subjects by using the 372 nPOD cases subjected to genotyping. Indeed, White T1D Table 4 (Fig. 6c). Taken together, these results indicate that the nPOD cohort UFDIchip array data represent a validated resource for genetic studies of T1D. Additionally, we provide GRS1, GRS2, and AA-GRS genotypes and calculated scores to the community for future studies (GRS1_ GRS2_AAGRS_TOPMed_Imputed.xlsx 27 ). Note that these scores differ from those provided in Kaddis, et al. 53 , due to updating the reference cohort for imputation from the Haplotype Reference Consortium (HRC) 61 cohort, with predominantly European ancestry, to the TOPMed 31 reference, with diverse genetic ancestry.
WES. 207 nPOD donors were also queried for rare variants using WES and associated data are accessible on dbGaP (see Data Records). Standard QC measures were performed to minimize adapter contamination, low-quality reads, error rate, and sequencing bias. To further validate data quality, we measured concordance between genotype calls from the UFDIchip and WES (N = 167 donors) Six of the nPOD donors were previously reported to have genetic variants with possible clinical impact in KCNJ11, LMNA, HNF1A, GLIS3, INSR, and GATA6 using a custom-designed NGS panel that included 140 genes implicated in monogenic diabetes 62 . These genetic variants were validated with WES (Table 4) and visual exploration of the data using the Integrative Genomics Viewer (IGV) 63 confirmed reads for each variant (Fig. 7). WES captures genomic DNA sequence in exons and the intronic sequence adjacent to exons. This enables the discovery of variants that directly alter the protein coding portion of mRNA (missense, nonsense, insertion/ deletions) and also some regulatory intronic sequences, such as splice sites. Variants in genes associated with autoimmune diabetes (AIRE, FOXP3, IL2RA, ITCH, LRBA, SKAP2, STAT1, and STAT3) or MODY and neonatal diabetes (GCK, HNF1A, HNF4A, HNF1B, ABCC8, KCNJ11, and INS) 39,[64][65][66] were observed in 141 nPOD cases with T1D (Fig. 8a). Monogenic forms of diabetes are rare, and the vast majority of the detected variants are not expected to have functional or clinical consequences. www.nature.com/scientificdata www.nature.com/scientificdata/  Note that functional studies are needed for potential monogenic diabetes variants that have not already been previously validated.  www.nature.com/scientificdata www.nature.com/scientificdata/ There are several databases and tools available to help with the identification and interpretation of genetic variants. For example, the frequency of a variant in the general population can be estimated using the gno-mAD, which contains data from 140,000 + exomes and genomes from unique, unrelated individuals spanning six global ancestries 47 . Additionally, the CADD score can be used to predict severity of impact of the variants based on a variety of criteria such as sequence context, evolutionary constraint, and functional predictions 46 . As expected, the variants observed in T1D cases were distributed across a spectrum of functional classes, with only a few predicted to be both rare (frequency < 0.01%) and deleterious (CADD score ≥ 20, Fig. 8b-d). As an example of how these tools can be used, variants in the monogenic diabetes genes HNF1A and STAT1 were analysed in the nPOD donors classified as T1D. One variant for each gene was predicted to be rare and deleterious based on the thresholds set for the gnomAD frequency and CADD score (Fig. 9, Table 5). The thresholds set for these and other bioinformatic tools are determined by each investigator, and are often informed by the clinical phenotype of the patient and previous knowledge about the gene's disease association. Other variant annotations from tools including ACMG Classification 43 , SIFT Function Prediction 44 , PolyPhen-2 Function Prediction 45 , HGMD 48 , ClinVar 49 , and CentoMD 50 are available for all 207 nPOD donors on dbGaP (see Data Records). A suggested workflow for evaluating genetic variants for potential clinical significance is shown in Fig. 10. Importantly, while computational tools facilitate interpretation, confidence in the functional or clinical relevance of the genetic variants reported herein requires rigorous experimentation.

Usage Notes
The associated data are openly available with unrestricted access.

Code availability
No custom code or scripts were used for the curation and validation of the dataset.